FASTQ | 1000 Genomes

What do the names of your fastq files mean?

Answer:

Our sequence files are distributed in gzipped fastq format

Our files are named with the SRA run accession E?SRR000000.filt.fastq.gz. All the reads in the file also hold this name. The files with _1 and _2 in their names are associated with paired end sequencing runs. If there is also a file with no number it is name this represents the fragments where the other end failed qc. The .filt in the name represents the data in the file has been filtered after retrieval from the archive. This filtering process is described in a README.

Where can I find phase3 alignment BAM files and read fastq files on the ftp site?

Answer:

You can find all the 1000 Genomes phase 3 BAM and fastq files in:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data

All BAM files from IGSR can be found in:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections

Why are there more than one set of fastq files associated with an individual?

Answer:

Many of our individuals have multiple fastq files. This is because many of our individual were sequenced using more than one run of a sequencing machine.

Each set of files named like ERR001268_1.filt.fastq.gz, ERR001268_2.filt.fastq.gz and ERR001268.filt.fastq.gz represent all the sequence from a sequencing run.

When a individual has many files with different run accessions (e.g ERR001268), this means it was sequenced multiple times. This can either be for the same experiment, some centres used multiplexing to have better control over their coverage levels for the low coverage sequencing, or because it was sequenced using different protocols or on different platforms.

For a full description of the sequencing conducted for the project please look at our sequence.index file

Why is the sequence data distributed in 2 or 3 files labelled SRR_1, SRR_2 and SRR?

Answer:

We distribute our fastq files for our paired end sequencing in 2 files, mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.

IGSR: The International Genome Sample Resource

Supporting open human variation data

Links

What do the names of your fastq files mean?

Answer:

Related questions:

Where can I find phase3 alignment BAM files and read fastq files on the ftp site?

Answer:

Why are there more than one set of fastq files associated with an individual?

Answer:

Related questions:

Why is the sequence data distributed in 2 or 3 files labelled SRR_1, SRR_2 and SRR?

Answer:

Related questions: